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Abstract 

We study the information-theoretic limits of exactly recovering the support of a sparse sig- 
nal using noisy projections defined by various classes of measurement matrices. Our analysis 
is high-dimensional in nature, in which the number of observations n, the ambient signal di- 
mension p, and the signal sparsity k are all allowed to tend to infinity in a general manner. 
This paper makes two novel contributions. First, we provide sharper necessary conditions for 
exact support recovery using general (non- Gaussian) dense measurement matrices. Combined 
with previously known sufficient conditions, this result yields sharp characterizations of when 
the optimal decoder can recover a signal for various scalings of the sparsity k and sample size n, 
including the important special case of linear sparsity (k = 0(p)) using a linear scaling of obser- 
vations (n = <d(p)). Our second contribution is to prove necessary conditions on the number of 
observations n required for asymptotically reliable recovery using a class of 7-sparsified measure- 
ment matrices, where the measurement sparsity ~/(n, p, k) S (0, 1] corresponds to the fraction of 
non-zero entries per row. Our analysis allows general scaling of the quadruplet (n,p, k,j), and 
reveals three different regimes, corresponding to whether measurement sparsity has no effect, 
a minor effect, or a dramatic effect on the information-theoretic limits of the subset recovery 
problem. 

Keywords: Sparsity recovery; sparse random matrices; subset selection; compressive sensing; 
signal denoising; sparse approximation; information-theoretic bounds; Fano's inequality. 

1 Introduction 

The problem of estimating a /c-sparse vector f3 G W based on a set of n noisy linear observations is 
of broad interest, arising in subset selection in regression, graphical model selection, group testing, 
signal denoising, sparse approximation, and compressive sensing. A large body of recent work 
(e.g., [SI El US El SI HU ED E21 S3 H3 C3 ESI EHl US] ) has analyzed the use of ^-relaxation methods 
for estimating high-dimensional sparse signals, and established conditions (on signal sparsity and 
the choice of measurement matrices) under which they succeed with high probability. 

Of complementary interest are the information-theoretic limits of the sparsity recovery problem, 
which apply to the performance of any procedure regardless of its computational complexity. Such 
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analysis has two purposes: first, to demonstrate where known polynomial-time methods achieve 
the information-theoretic bounds, and second, to reveal situations in which current methods are 
sub-optimal. An interesting question which arises in this context is the effect of the choice of 
measurement matrix on the information-theoretic limits of sparsity recovery. As we will see, the 
standard Gaussian measurement ensemble is an optimal choice in terms of minimizing the number 
of observations required for recovery. However, this choice produces highly dense measurement 
matrices, which may lead to prohibitively high computational complexity and storage requirements. 
Sparse matrices can reduce this complexity, and also lower communication cost and latency in 
distributed network and streaming applications. On the other hand, such measurement sparsity, 
though beneficial from the computational standpoint, may reduce statistical efficiency by requiring 
more observations to decode. Therefore, an important issue is to characterize the trade-off between 
measurement sparsity and statistical efficiency 

With this motivation, this paper makes two contributions. First, we derive sharper necessary 
conditions for exact support recovery, applicable to a general class of dense measurement matrices 
(including non-Gaussian ensembles). In conjunction with the sufficient conditions from previous 
work [23], this analysis provides a sharp characterization of necessary and sufficient conditions 
for various sparsity regimes. Our second contribution is to address the effect of measurement 
sparsity, meaning the fraction 7 G (0, 1] of non-zeros per row in the matrices used to collect 
measurements. We derive lower bounds on the number of observations required for exact sparsity 
recovery, as a function of the signal dimension p, signal sparsity k, and measurement sparsity 7. 
This analysis highlights a trade-off between the statistical efficiency of a measurement ensemble 
and the computational complexity associated with storing and manipulating it. 

The remainder of the paper is organized as follows. We first define our problem formulation in 
Section and then discuss our contributions and some connections to related work in Section 1.2 



Section [2] provides precise statements of our main results, as well as a discussion of their conse- 
quences. Section |3] provides proofs of the necessary conditions for various classes of measurement 
matrices, while proofs of more technical lemmas are given in the appendices. Finally, we conclude 
and discuss open problems in Section |4} 



1.1 Problem formulation 

There are a variety of problem formulations in the growing body of work on compressive sensing 
and related areas. The signal model may be exactly sparse, approximately sparse, or compressible 
(i.e. that the signal is approximately sparse in some orthonormal basis). The most common signal 
model is a deterministic one, although Bayesian formulations are also possible. In addition, the 
observation model can be either noiseless or noisy, and the measurement matrix can be random or 
deterministic. Furthermore, the signal recovery can be perfect or approximate, assessed by various 
error metrics (e.g., £ g -norms, prediction error, subset recovery). 

In this paper, we consider a deterministic signal model, in which j3 G W is a fixed but unknown 
vector with exactly k non-zero entries. We refer to k as the signal sparsity and p as the signal 
dimension, and define the support set of f3 as 

S := {i G {1, . . .,p} I fa ^ 0}. (1) 

Note that there are ./V = (?) possible support sets, corresponding to the N possible fc-dimensional 
subspaces in which (3 can lie. We are given a vector of n noisy observations Y G M n , of the form 

Y = X(3 + W, (2) 
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where X S M nxp is the measurement matrix, and W ~ N(0,a 2 I nxn ) is additive Gaussian noise. 
Our results apply to various classes of dense and 7-sparsified measurement matrices, which will be 
defined concretely in Section |2j Throughout this paper, we assume without loss of generality that 
a 1 = 1, since any scaling of a can be accounted for in the scaling of [3. 

Our goal is to perform exact recovery of the support set S, which corresponds to a standard 
model selection error criterion. More precisely, we measure the error between the estimate [3 and 
the true signal (3 using the {0, l}-valued loss function: 

p0, (3): = I [{% jt 0, Vi G S} n {% = 0, Vj 5}] . (3) 

The results of this paper apply to arbitrary decoders. Any decoder is a mapping g from the obser- 
vations Y to an estimated subset 5 = g(Y). Let P[g(Y) ^ S \ S] be the conditional probability of 
error given that the true support is S. Assuming that (3 has support S chosen uniformly at random 
over the N possible subsets of size k, the average probability of error is given by 

= mEw^ I s\- ( 4 ) 

\k) s 

We say that sparsity recovery is asymptotically reliable if p err — * as n — * 00. Since we are 
trying to recover the support exactly from noisy measurements, our results necessarily involve the 
minimum value of (3 on its support, 

I3 m i n := min|/3j|. (5) 

In particular, our results apply to decoders that operate over the signal class 

C(Pmin) := {(3 6 R p I \(3i\ > (3 mm Vi G S}. (6) 

With this set-up, our goal is to find necessary conditions on the parameters (n,p, k, (3 m i n , 7) that 
any decoder, regardless of its computational complexity, must satisfy for asymptotically reliable 
recovery to be possible. We are interested in lower bounds on the number of measurements n, in 
general settings where both the signal sparsity k and the measurement sparsity 7 are allowed to 
scale with the signal dimension p. As our analysis shows, the appropriate notion of rate for this 

problem is R = log J fc ^ . 

1.2 Our contributions 

One body of past work |12 1 I18 [ H] has focused on the information-theoretic limits of sparse estimation 
under £2 and other distortion metrics, using power-based SNR measures of the form 



SNR inwi " Wh ~ (7) 



(Note that the second equality assumes that the noise variance a 2 = 1, and that the measurement 
matrix is standardized, with each element Xij having zero-mean and variance one.) It is important 
to note that the power-based SNR ([7]), though appropriate for ^-distortion, is not suitable for the 
support recovery problem. Although the minimum value is related to this power-based measure 
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by the inequality kf3^ nin < SNR, for the ensemble of signals C{j3 m i n ) defined in equation ([6]), the 
^2-based SNR measure Q can be made arbitrarily large, while still having one coefficient (5i equal 
to the minimum value (assuming that k > 1). Consequently, as our results show, it is possible 
to generate problem instances for which support recovery is arbitrarily difficult — in particular, by 
sending j3 m in —> at an arbitrarily rapid rate — even as the power-based SNR ([7| becomes arbitrarily 
large. 

The paper [21] was the first to consider the information-theoretic limits of exact subset recovery 
using dense Gaussian measurement ensembles, explicitly identifying the minimum value f3 m i n as 
the key parameter. This analysis yielded necessary and sufficient conditions on general quadruples 
(n,p,k, flmin) for asymptotically reliable recovery. Subsequent work [TBI IZ] has extended this type 
of analysis to the criterion of partial support recovery. In this paper, we consider only exact support 
recovery, but provide results for general dense measurement ensembles, thereby extending previous 
results. In conjunction with known sufficient conditions [21] , one consequence of our first main result 
(Theorem [T] below) is a set of sharp necessary and sufficient conditions for the optimal decoder 
to recover the support of a signal with linear sparsity (k = @(p)), using only a linear fraction 
of observations (n = O(p)). Moreover, for the special case of the standard Gaussian ensemble, 
Theorem [T] also recovers some results independently obtained in concurrent work by Reeves |16j . 
and Fletcher et al. 

We then consider the effect of measurement sparsity, which we assess in terms of the fraction 7 E 
(0, 1] of non-zeros per row of the the measurement matrix X. Some past work in compressive sensing 
has proposed computationally efficient recovery methods based on sparse measurement matrices, 
including work inspired by expander graphs and coding theory |26[ [T9] , sparse random projections 
for Johnson-Lindenstrauss embeddings |25| . and sketching and group testing [13, 7 '. . All of this work 
deals with the noiseless observation model, in contrast to the noisy observation model ^ considered 
here. The paper [TJ provides results on sparse measurements for noisy problems and distortion- 
type error metrics, using a Bayesian signal model and power-based SNR that is not appropriate 
for the subset recovery problem. Also, some concurrent work |15j provides sufficient conditions 
for support recovery using the Lasso (^i-constrained quadratic programming) for appropriately 
sparsified ensembles. These results can be viewed as complementary to the information-theoretic 
analysis of this paper. In this paper, we characterize the inherent trade-off between measurement 
sparsity and statistical efficiency. More specifically, our second main result (Theorem [2] below) 
provides necessary conditions for exact support recovery, using 7-sparsified Gaussian measurement 
matrices (see equation (|8|), for general scalings of the parameters (n,p, k, P m in, l)- This analysis 
reveals three regimes of interest, corresponding to whether measurement sparsity has no effect, a 
small effect, or a significant effect on the number of measurements necessary for recovery. Thus, 
there exist regimes in which measurement sparsity fundamentally alters the ability of any method 
to decode. 

2 Main results and consequences 

In this section, we state our main results, and discuss some of their consequences. Our analysis 
applies to random ensembles of measurement matrices X £ M. nxp , where each entry Xij is drawn 
i.i.d. from some underlying distribution. The most commonly studied random ensemble is the 
standard Gaussian case, in which each Xij ~ N(0, 1). Note that this choice generates a highly 
dense measurement matrix X, with np non-zero entries. Our first result (Theorem nl applies to 



4 



more general ensembles that satisfy the moment conditions IEpQj] = and var(Xjj) = 1, which 
allows for a variety of non-Gaussian distributions (e.g., uniform, Bernoulli etc.). In addition, we 
also derive results (Theorem [2]) for 7-sparsified matrices X, in which each entry X^ is i.i.d. drawn 
according to 

[ w.p. 1 — 7 

Note that when 7 = 1, X is exactly the standard Gaussian ensemble. We refer to the sparsification 
parameter < 7 < 1 as the measurement sparsity. Our analysis allows this parameter to vary as a 
function of (n,p, k). 

2.1 Tighter bounds on dense ensembles 

We begin by noting an analogy to the Gaussian channel coding problem that yields a straightforward 
but loose set of necessary conditions. Support recovery can be viewed as a channel coding problem, 
in which there are N = (ju possible support sets of (3, corresponding to messages to be sent over 

a Gaussian channel with noise variance 1. The effective code rate is then R = log ( fc ) . If each 

n 

support set S is encoded as the codeword c{S) = Xf3, where X has i.i.d. Gaussian entries, then by 
standard Gaussian channel capacity results, we immediately obtain a lower bound on the number 
of observations n necessary for asymptotically reliable recovery, 

log (?) 

I log (1 + II/3HI) • W 

This bound is tight for k = 1 and Gaussian measurements, but loose in general. As Theorem [T] 
clarifies, there are additional elements in the support recovery problem that distinguish it from 
a standard Gaussian coding problem: first, the signal power \\PW2 does not capture the inherent 
problem difficulty for k > 1, and second, there is overlap between support sets for k > 1. The 
following result provides sharper conditions on subset recovery. 

Theorem 1 (General ensembles). Let the measurement matrix X G M nxp be drawn with i.i.d. 
elements from any distribution with zero-mean and variance one. Then a necessary condition for 
asymptotically reliable recovery over the signal class C{(5 m i n ) is 

n > max {fi(p,k,Pm in ), /2(p,fc,Amn), k—l}, (10) 

where 

log (I) - 1 

fl(p,k,(3 min ) := (11a) 



£log(l + *^(l-&)) 



f l , n \ log(p- k+ 1) , 

f2{p,k,(3 min ) := ^. (lib) 

2 log l+/?Ln(l- 



-jfc+1. 



The proof of Theorem [T] given in Section [3] uses Fano's inequality to bound the probability 
of error of any recovery method. In addition to the standard Gaussian ensemble (Xij ~ N(0, 1)), 
this result also covers matrices from other common ensembles (e.g., Bernoulli Xij G {— 1,+1}). It 
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Necessary conditions 


Sufficient conditions 




(Theorem lh 


(Wainwright [231) 


k = e(p) 


0(plogp) 


0(p logp) 


= e(p) 

l J min ~ W V k 1 


8(p) 


8(p) 


k = e(p) 

PLn = ©(I) 


e(p) 


e(p) 


= o(p) 
Pmin = 0(fc) 


e(jfeiog(p-jfc)) 


e(jfeiog(p-ife)) 


fc = o(p) 

Vmin ~ W V k ) 




e(Hogf) 


k = o(p) 

PLn = ©(I) 




e(Hogf) 



Table 1. Tight necessary and sufficient conditions on the number of observations n required for 
exact support recovery are obtained in several regimes of interest. 



generalizes and strengthens earlier results on subset recovery [M]. Note that \\/3\\l > kf3 2 min (with 
equality in the case when |/3j| = (5 m in for all indices i S 5), so that this bound is strictly tighter 
than the intuitive bound Moreover, by fixing the value of (3 at (k — 1) indices to (3 m in an d 
allowing the last component of j3 to tend to infinity, we can drive the power ||/3||2 to infinity, while 
still having the minimum enter the lower bound. 

The necessary conditions in Theorem [T] can be compared against the sufficient conditions in 
Wainwright [24 for exact support recovery using the standard Gaussian ensemble, as shown in 
Table [TJ We obtain tight necessary and sufficient conditions in the regime of linear signal sparsity 
(meaning k/p = a for some a G (0,1)), under various scalings of the minimum value f3 m in- We 
also obtain tight matching conditions in the regime of sublinear signal sparsity (in which k/p — ► 0), 
when k(3^ in = 9(1). There remains a slight gap, however, in the sublinear sparsity regime when 
kftmin ~ * 00 ( see bottom two rows in Table [l]). Moreover, these information-theoretic bounds can 
be compared to the recovery threshold of .^-constrained quadratic programming, known as the 
Lasso [23]. This comparison reveals that whenever k^^in = 0(1) both the linear and sublinear 
sparsity regimes), then 0(/clog(p— k)) observations are necessary and sufficient for sparsity recovery, 
and hence the Lasso method is information-theoretically optimal. In contrast, when fe/3^j n — ► oo 
and k/p = a, there is a gap between the performance of the Lasso and the information-theoretic 
bounds. 

Theorem [T] has some consequences related to results proved in concurrent work. Reeves and 
Gastpar [TB] have shown that in the regime of linear sparsity k/p = a > 0, if any decoder is 
given only a linear fraction sample size (meaning that n = 0(p)), then in order to recover the 
support exactly, one must have kf3^ nin — > +oo. This result is one corollary of Theorem [TJ since if 
Pmin = ®(1 A) j then we have 

n > y^nn/^ = n(klog(p-k)) >0(p) ) 

5 iog(i + e(i/&)) 

so that the scaling n = ®(p) is precluded. In other concurrent work, Fletcher et al. [TT] used direct 
methods to show that for the special case of the standard Gaussian ensemble, the number of observa- 



6 



tions must satisfy n > $7 I — - I . This bound is a consequence of our lower bound f2{p, k, (3 m i n ); 




moreover, Theorem [T] implies the same lower bound for general (non-Gaussian) ensembles as well. 

In the regime of linear sparsity, Wainwright [23] showed, by direct analysis of the optimal 
decoder, that the scaling /5^ in = £l(log(k)/k) is sufficient for exact support recovery using a linear 
fraction n = O(p) of observations. Combined with the necessary condition in Theorem [T] we obtain 
the following corollary that provides a sharp characterization of the linear-linear regime: 

Corollary 1. Consider the regime of linear sparsity, meaning that k/p = a E (0, 1), and suppose 
that a linear fraction n = ®{p) of observations are made. Then the optimal decoder can recover the 
support exactly if and only if P^in = ^(logfe/fc). 

2.2 Effect of measurement sparsity 

We now turn to the effect of measurement sparsity on recovery, considering in particular the 7- 
sparsified ensemble pi). Even though the average signal-to- noise ratio of our channel remains 
the same (since varpQj) = 1 for all choices of 7 by construction), the Gaussian channel coding 
bound ([9| is clearly not tight for sparse X, even in the case of k = 1. The loss in statistical efficiency 
is due to the fact that we are constraining our codebook to have a sparse structure, which may be 
far from a capacity-achieving code. Theorem [T] applies to any ensemble in which the components 
are zero-mean and unit variance. However, if we apply it to the 7-sparsified ensemble, it yields 
lower bounds that are independent of 7. Intuitively, it is clear that the procedure of 7-sparsification 
should cause deterioration in support recovery. Indeed, the following result provides refined bounds 
that capture the effects of 7-sparsification. Let (j)(fJ,, cr 2 ) denote the Gaussian density with mean \i 
and variance a 2 , and define the following two mixture distributions: 



Furthermore, let H(-) denote the entropy functional. With this notation, we have the following 
result. 

Theorem 2 (Sparse ensembles). Let the measurement matrix X G M nxp be drawn with i.i.d. ele- 
ments from the 'y-sparsified Gaussian ensemble Q. Then a necessary condition for asymptotically 
reliable recovery over the signal class C((3 m in) is 




1=0 



(12) 




(13) 



n > max 



{9i(p,k,Pmin,7)> 92(p,k,(3 m in,l), k-1], 



(14) 



where 



gi(p,k,(3 m in,7) 



log ffl " 1 



(15a) 




1 



92(p,k,Pmin,7) 



log(p — k + 1) — 1 
^ 2 )-ilog(2vre)' 



(15b) 
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Figure 1. The rate R 
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is plotted using equation ( 14 1 in three regimes, depending on how the 



quantity 'yk scales, where 7 € [0, 1] denotes the measurement sparsification parameter and k denotes 
the signal sparsity. 



The proof of Theorem [2] given in Section [3j again uses Fano's inequality, but explicitly analyzes 
the effect of measurement sparsification on the distribution of the observations. The necessary 
condition in Theorem [2] is plotted in Figure [TJ showing distinct regimes of behavior depending on 
how the quantity 7 A; scales, where 7 G [0, 1] is the measurement sparsification parameter and k is 
the signal sparsity index. In order to characterize the thresholds at which measurement sparsity 
begins to degrade the performance of any decoder, Corollary [2] below further bounds the necessary 
conditions in Theorem [2] in thr ee cases. For any scalar 7, let -fff»nan/(7) denote the entropy of a 
Ber(7) variate. 

Corollary 2 (Three regimes). The necessary conditions in Theorem^ can be simplified as follows, 
(a) In general, 



gi(p,k,(3 min ,j) > 
g2{p,k,f3 m in,l) > 



l»g (I) ~ 1 
log(p — k + 1) — 1 



ilog(l + /4 in ) 



(16a) 
(16b) 



(b) If jk = t for some constant r, then 
g2{p,k,(3 m in, / y) > 



log (?) - 1 



^rlog(l + 



k/3 2 

! n 



+ C 



\og{p — k + 1) — 1 



(17a) 
(17b) 



where C = 2log(27re(r + ^)) is a constant. 
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Necessary conditions 
(Theorem [2| 


A; = o(p) 


fc 


= G(p) 


fi'min = ®(fc) 

T = o( fclo 1 Kfc ) 


\_ 7^ log ^ y 


e 


' plogp j 

^7piog^ y 


7 ~~ ^( fc log k> 


8(fclog(p — fc)) 


O(plogp) 


^mm W V fc J 


\ i k log ^ y 


e 


' plogp | 

\^7piog^ y 


fl2 _Q(logk\ 
7 = ©(it lo K fe) 


8(fclog(p — fc)) 


O(plogp) 


7 = «(i) 


— {e(SS),e(~)} 


9(p) 



Table 2. Necessary conditions on the number of observations n required for exact support recovery 
is shown in different regimes of the parameters (p, k, f3 m i n , 7). 



(c) If jk < I, then 



log (?) - 1 



| 7 A:log (l + + kHunaryil) 

log(p — + 1) — 1 

l 7 log(l + ^} + jI wnary ( 7 ) 



(18a) 
(18b) 



Corollary [2] reveals three regimes of behavior, defined by the scaling of the measurement sparsity 
7 and the signal sparsity k. If jk —* 00 as p — > 00, then the recovery threshold (16) is of the 
same order as the threshold for dense measurement ensembles. In this regime, sparsifying the 
measurement ensemble has no asymptotic effect on performance. In sharp contrast, if jk — > 
sufficiently fast as p — > 00, then the recovery threshold (18) changes fundamentally compared to 
the dense case. Finally, if -yk = 0(1), then the recovery threshold (17) transitions between the two 



extremes. Using the bounds in Corollary [2] the necessary conditions in Theorem [2] are shown in 



1 



k log k 1 



Table |2j under different scalings of the parameters (n,p,k, ^,7). In particular, if 7 = o( 
and the minimum value /3^ in does not increase with k, then the denominator 7 & log - goes to zero. 
Hence, the number of measurements that any decoder needs in order to recover reliably increases 
dramatically in this regime. 



3 Proofs of our main results 

In this section, we provide the proofs of Theorems [T] and [2j Establishing necessary conditions 
for exact sparsity recovery amounts to finding conditions on (n,p,k, P m in) (and possibly 7) under 
which the probability of error of any recovery method stays bounded away from zero as n —* 00. 
At a high-level, our general approach is quite simple: we consider restricted problems in which the 
decoder has been given some additional side information, and then apply Fano's inequality [8] to 
lower bound the probability of error. In order to establish the two types of necessary conditions 
(e.g, fi{p, k, fimin) versus f2(p, k, (3 m in)), we consider two classes of restricted ensembles: one which 
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(a) (b) 
Figure 2. Illustration of restricted ensembles, (a) In restricted ensemble A, the decoder must 
distinguish between (^) support sets with an average overlap of size ^- , whereas in restricted ensemble 
B, it must decode amongst a subset of the k(j>— k) + 1 supports with overlap k — X. (b) In restricted 
ensemble B, the decoder is given the locations of the k — 1 largest non-zeros, and it must estimate 
the location of the smallest non-zero from the p — k + 1 remaining possible indices. 



captures the bulk effect of having many competing subsets at large distances, and the other which 
captures the effect of a smaller number of subsets at very close distances. This is illustrated in 



Figure 2a We note that although the first restricted ensemble is a harder problem, applying Fano 
to the second restricted ensemble yields a tighter analysis in some regimes. In all cases, we assume 
that the support S of the unknown vector G MP is chosen randomly and uniformly over all 
possible support sets. Throughout the remainder of the paper, we use the notation Xj G M n to 
denote column j of the matrix X, and Xjj G M nx l C7 l to denote the submatrix containing columns in- 



dexed by set U . Similarly, let /% G W ' denote the subvector of (3 corresponding to the index set U . 

Restricted ensemble A: In the first restricted problem, also exploited in previous work [23], we 
assume that while the support set S is unknown, the decoder knows a priori that (3j = (3 m i n for 
all j G S. In other words, the decoder knows the value of (5 on its support, but it does not know 
the locations of the non-zeros. Conditioned on the event that S is the true underlying support of 
f3, the observation vector Y G M n can then be written as 

Y := ^XjP min + W. (19) 

If a decoder can recover the support of any p-dimensional /c-sparse vector /3, then it must be able 
to recover a fc-sparse vector that is constant on its support. Furthermore, having knowledge of 
the value /3 m in at the decoder cannot increase the probability of error. Finally, we assume that 
/3j = Pmin for all j G S to construct the most difficult possible instance within our ensemble. Thus, 
we can apply Fano's inequality to lower bound the probability of error in the restricted problem, 
and so obtain a lower bound on the probability of error for the general problem. This procedure 
yields the lower bounds fi(p, k,(3 m i n ) and gi(p, k, p m i n ,-f) in Theorems [T] and [2] respectively. 

Restricted ensemble B: The second restricted ensemble is designed to capture the confusable 



effects of the relatively small number (jp — k + 1) of very close- by subsets (see Figure 2b). This 



restricted ensemble is defined as follows. Suppose that the decoder is given the locations of all 
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but the smallest non-zero value of the vector 0, as well as the values of on its support. More 
precisely, let j* denote the unknown location of the smallest non-zero value of 0, which we assume 
achieves the minimum (i.e., 0j* = m in)> and let T = S \ {j*}- Given knowledge of (T, 0t, Pmin), 
the decoder may simply subtract Xt0t = ^2jeT Xj0j from Y, so that it is left with the modified 
n-vector of observations 

Y := X 3 ,(3 min + W. (20) 

By re-ordering indices as need be, we may assume without loss of generality that T = {p — k + 
2, . . . ,p}, so that j* E {1, . . . ,p — k + 1}. The remaining sub-problem is to determine, given the 
observations Y, the location of the single non-zero. Note that when we assume that the support 
of is uniformly chosen over all (?) possible subsets of size k, then given T, the location of the 
remaining non-zero is uniformly distributed over {1, . . . ,p — k + 1}. 

We will now argue that analyzing the probability of error of this restricted problem gives us 
a lower bound on the probability of error in the original problem. Let £ be a vector 

with exactly one non-zero. We can augment (3 with k — 1 non-zeros at the end to obtain a p- 
dimensional vector. If a decoder can recover the support of any p-dimensional fc-sparse vector 0, 
then it can recover the support of the augmented 0, and hence the support of 0. Similarly, providing 
the decoder with side information about the non-zero values of cannot increase the probability 
of error. As before, we can apply Fano's inequality to lower bound the probability of error in 
this restricted problem, thereby obtaining the lower bounds f2(p, k, m in) and g2(p, k, m i n , 7) in 
Theorems [T] and [2] respectively. 

3.1 Proof of Theorem Q] 

In this section, we derive the necessary conditions fi(p, k, m in) and f2(p, k, m in) in Theorem [T] for 
the general class of measurement matrices, by applying Fano's inequality to bound the probability 
of decoding error in restricted problems A and B, respectively. 



3.1.1 Applying Fano to restricted ensemble A 

We first perform our analysis of the error probability for a particular instance of the random 
measurement matrix X, and subsequently average over the ensemble of matrices. Let £1 denote 
a random subset chosen uniformly at random over all (^) subsets S C {1, . . . ,p} of size k. The 
probability of decoding error, for a given X, can be lower bounded by Fano's inequality as 

H(n\Y) - 1 J(ft; Y) + 1 

Perr{X) > j- - 1 _ j-. 

lo s (D lQ g (fc) 

where we have used the fact that H{9\Y) = F(O) - Y) = log (jQ - 7(0; Y). Thus the problem 
is reduced to upper bounding the mutual information 1(0,; Y) between the random subset O and 
the noisy observations Y. Since both X and m in are known and fixed, the mutual information 
can be expanded as 

I(Q;Y) = H(Y)-H(Y\fl) = H(Y)-H(W). 

We first bound the entropy of the observation vector H(Y), using the fact that differential entropy 
is maximized by the Gaussian distribution with a matched variance. More specifically, for a given 
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X, let A(X) denote the covariance matrix of Y conditioned on X. (Hence entry Aa(X) on the 
diagonal represents the variance of Yi.) With this notation, the entropy of Y can be bounded as 

n 

H(Y) < Y. H ^) 
i=i 



n 

< £-log(27reA«(X)). 



2 



Next, the entropy of the Gaussian noise vector W ~ N(0, I nxn ) can be computed as H(W) = 
^log(27re). Combining these two terms, we then obtain the following bound on the mutual infor- 
mation, 



n 

W;Y) < V-log(A«pO). 



^ 2 

i=l 

With this bound on the mutual information, we now average the probability of error over the 
ensemble of measurement matrices X. Exploiting the concavity of the logarithm and applying 
Jensen's inequality, the average probability of error can be bounded as 

E X [Perr{X)\ > 1 . (21) 

lo g (*) 

It remains to compute the expectation Ex[A&(X)], over the ensemble of matrices X drawn with 
i.i.d. entries from any distribution with zero-mean and unit variance. The proof of the following 
lemma involves some relatively straightforward but lengthy calculation, and is given in Appendix [A) 

Lemma 1. Given i.i.d. with zero-mean and unit variance, the average covariance is given by 

3 2 I - & 



E X [A(X)} = + (22) 

Finally, combining Lemma [T] with equation (21 ), we obtain that the average probability of error 
is bounded away from zero if 

log 

n < 



2 (WLn 1-| +1 



as claimed. 



3.1.2 Applying Fano to restricted ensemble B 

The analysis of restricted ensemble B is completely analogous to the proof for restricted ensemble 
A, so we will only outline the key steps below. Let Q denote a random variable with uniform 
distribution over the indices {1, . . . ,p — k + 1}- The probability of decoding error, for a given 
measurement matrix X, can be lower bounded by Fano's inequality as 

pMx) > i m9) + 1 



log(p - k + 1) 
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As before, the key problem of bounding the mutual information 1(0,; Y) between the random index 
Q and the modified observation vector Y, can be reduced to bounding the entropy H(Y). For each 
fixed X, let A(X) denote the covariance matrix of Y. Since the differential entropy of Y{ is upper 
bounded by the entropy of a Gaussian distribution with variance Au(X), we obtain the following 
bound on the mutual information 

11 

I(Q;Y) = F(y)--log(27re) 



n , 

< £-log(A«pO). 



2 

i=l 

Applying Jensen's inequality, we can then bound the average probability of error, averaged over 
the ensemble of measurement matrices X, as 

E x \p err (X)) > 1 i og{p - k + 1) • ( 23 ) 

The proof of Lemma [2] below follows the same steps as the derivation of Lemma [TJ and is omitted. 
Lemma 2. Given i.i.d. Xij with zero-mean and unit variance, the average covariance is given by 

Ex[ApQ] = (l + Plin (l~ p _fc + 1 )) W (24) 



Finally, combining Lemma [2] with the Fano bound (23), we obtain that the average probability 
of error is bounded away from zero if 



n < 



log(p — k + 1) — 1 
ilog(l + /?l n (l-_i_)) 



as claimed. 



3.2 Proof of Theorem! 



This section contains proofs of the necessary conditions in Theorem [2] for the 7-sparsified Gaussian 
measurement ensemble (Js]) _ We proceed as before, applying Fano's inequality to restricted problems 
A and B, in order to derive the conditions gi(p, k, flmiml) and g2(p, k, /3mm) 7)1 respectively. 



3.2.1 Analyzing restricted ensemble A 

In analyzing the probability of error in restricted ensemble A, the initial steps proceed as in the 
proof of Theorem [T] first bounding the probability of error for a fixed instance of the measurement 
matrix X, and later averaging over the 7-sparsified Gaussian ensemble (|8]). Let denote a random 
subset uniformly distributed over the (?) possible subsets S C {1, . . . ,p} of size k. As before, the 
probability of decoding error, for each fixed X, can be lower bounded by Fano's inequality as 

Perr(X) > 1 . 

lo s (D 
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We can similarly bound the mutual information 



I(n;Y) = H(Y)-H(W) 



< £ff(y,)-^log(27re), 



i=i 



using the Gaussian entropy for W ~ N(0, I nxn ). 

From this point, the key subproblem is to compute the entropy of Yi = Y^jes XijPmin + W{. To 
characterize the limiting behavior of the random variable Yi , note that Yi is distributed according 
to the density defined as 



exp 



1 



For each fixed matrix X, this density is a mixture of Gaussians with unit variances and means that 
depend on the values of {Xn, . . . , Xi p }, summed over subsets S C {1, . . . ,p} with \S\ = k. At a 
high-level, our immediate goal is to characterize the entropy H{ip\). 

Note that as X varies over the ensemble Q, the sequence ;X)} P , indexed by the signal 
dimension p, is actually a sequence of random densities. As an intermediate step, the following 
lemma characterizes the average pointwise behavior of this random sequence of densities, and is 
proven in Appendix [B) 

Lemma 3. Let X be drawn with i.i.d. entries from the j-sparsified Gaussian ensemble For 
each fixed y and for all i = 1, . . . , n, Kx[ipi(y, i; X)] = ipi(y), where 



^i(y) 



E, 



2tt(1 + 



: exp 



2(1 -)- 



(25) 



is a mixture of Gaussians with binomial weights L ~ Bm(k,j). 



For certain scalings, we can use concentration results for [/-statistics [20] to prove that ipi 
converges uniformly to tpi, and from there that H(ipi) ^> H(ipi). In general, however, we always 
have an upper bound, which is sufficient for our purposes. Indeed, since differential entropy H{ip\) 
is a concave function of ipi, by Jensen's inequality and Lemma [3] we have 

ExP(^i)] < H(E x [ih]) = H($x)- 

With these ingredients, we conclude that the average error probability of any decoder, averaged 
over the sparsified Gaussian measurement ensemble, is lower bounded by 



E x \p err (X)\ > 1 



> 1 



Er=i^[g(^)]-^log(27re) + l 
log (I) 

Er=i^[g(^)]-flog(27re) + l 
log (I) 

nHjiPj - I log(27re) + 1 
l°g(f) ' 
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Therefore, the probability of decoding error is bounded away from zero if 

log (*) " 1 



n < 



as claimed. 



3.2.2 Analyzing restricted ensemble B 

The analysis of restricted ensemble B mirrors exactly the derivation of restricted ensemble A. Hence 
we only outline the key steps in this section. Letting 0, ~ Uni{l, . . . ,p — k + 1}, we again apply 
Fano's inequality to restricted problem B, using the sparse measurement ensemble ([8]): 



Perr(X) > 1 



i(n-,Y) + 1 

log(p - k + 1) 



In order to upper bound I(Q;Y), we need to upper bound the entropy H(Y). The sequence of 
densities associated with K- becomes 



p-k+i 



p — k 



— r 4= 

fe + l 4-f V2tt 



exp 



1 



Lemma [4] below characterizes the average pointwise behavior of these densities, and follows from 
the proof of Lemma [3j with S taken to be subsets of the indices {1, . . . ,p — k + 1} of size \S\ = 1. 

Lemma 4. Let X be drawn with i.i.d. entries according to For each fixed y and for all 

i = l,...,n, E x [ip2{y,i]X)) = 4> 2 {y), where 



My) 



E 



B 



1 



2vr(l + 



■- exp 



2(\ -\- ^^min 



(26) 



is a mixture of Gaussians with Bernoulli weights B ~ Ber(7). 

As before, we can apply Jensen's inequality to obtain the bound 

E X [#(V>2)] < H(E x [ih]) = H@2)- 
The necessary condition then follows by the Fano bound on the probability of error. 

3.3 Proof of Corollary [2] 

In this section, we derive bounds on the expressions gi(p,k, f3 m i n , r f) and ^(Pi^^mmj) i n The- 



orem^ We begin by noting that the Gaussian mixture distribution ipi defined in (12) is a strict 
generalization of the distribution ip 2 defined in (13); moreover, setting the parameter k = 1 in ip 1 



recovers ip 2 - The variance associated with the mixture distribution ip 1 is equal to a\ = 1 + 



15 



and so the entropy of ^ is always bounded by the entropy of a Gaussian distribution with variance 



af, as 



tf(Vi) < -bg(27re(l + fc / 94 lin )). 



Similarly the mixture distribution i/j 2 has variance equal to 1 +/3^ ri , so that the entropy associated 
with ^2 can m general be bounded as 

H@ 2 ) < ^log(2vre(l + /3L n )). 



This yields the first set of bounds in (16). 

Next, to derive more refined bounds which capture the effects of measurement sparsity, we will 
make use of the following lemma (which is proven in Appendix [C]) to bound the entropy associated 
with the mixture distribution 



Lemma 5. For the Gaussian mixture distribution ipi defined in (12) 

~1 



log 2yre 1 + 



2 

rain 



7 



+ H(L), 



where L ~ Bin(fc, 7). 

We can further bound the expression in Lemma |5]in three cases, delineated by the quantity 7A;. 
The proof of the following claim in given in Appendix [Dj 



Lemma 6. Let E = E L 

(a) If jk > 3, then 



Ilog(l + ^ 



, where L ~ Bin(fc,7). 



log 1 + 



2 

rain 



< E < 



log (l + kP^in) . 



(b) If jk = t for some constant r, then 



^(l-e- r )log fl + ^mn 



< E < -rlogfl + ^^ 
2 V t 



(c) If jk < 1, then 



1 



-7/clog 1 + 



d 2 



7 



< E < 



1 



-7/clog 1 + 



it 2 



Finally, combining Lemmas [5] and [6] with some simple bounds on the entropy of the binomial 
variate L (given in Appendix [El) , we obtain the bounds on gi(p, k, Anin, 7) in (17) and (18). 

We can similarly bound the entropy associated with the Gaussian mixture distribution ipo. 
Since the density if) 2 is a special case of the density ip 1 with k set to 1, we can again apply Lemma |5| 
to obtain 



7 



1 



log 2vre 1 + 



B Pmin 



7 



+ H{B) 



2 log y 1 + + Hunaryil) + \ log(2vre). 



We have thus obtained the bounds on j2(p, Mminj) in equations (17) and (18). 
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4 Discussion 



In this paper, we have studied the information-theoretic limits of exact support recovery for gen- 
eral scalings of the parameters (n,p, k, f3 m i n , 7). Our first result (Theorem [T]) applies generally to 
measurement matrices with zero-mean and unit variance entries. It strengthens previously known 
bounds, and combined with known sufficient conditions [23], yields a sharp characterization of 
recovering signals with linear sparsity with a linear fraction of observations (Corollary |2|. Our sec- 
ond result (Theorem [2]) applies to 7-sparsified Gaussian measurement ensembles, and reveals three 
different regimes of measurement sparsity, depending on how significantly they impair statistical ef- 
ficiency. For linear signal sparsity, Theorem [2] is not a sharp result (by comparison to Theorem [T] in 
the dense case); however, its tightness for sublinear signal sparsity is an interesting open problem. 
Finally, Theorem [T] implies that the standard Gaussian ensemble is an information-theoretically 
optimal choice for the measurement matrix: no other zero-mean unit variance distribution can 
reduce the number of observations necessary for recovery, and in fact the standard Gaussian dis- 
tribution achieves matching sufficient bounds [24] . This fact raises an interesting open question on 
the design of other, more computationally friendly, measurement matrices which are optimal in the 
information-theoretic sense. 
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A Proof of Lemma [U 

We begin by defining some additional notation. Let (3 £ W be a fc-sparse vector with j3j = f3 m in 
for all indices j in the support set S. Recall that O denotes a random subset uniformly distributed 
over all (?) possible subsets S C {1, . . . ,p} with \S\ = k. Conditioned on the event that O = S. 
the vector of n observations can then be written as 

Y := X s p s + W = f3 min J2 X i + W. 

Note that for a given instance of the matrix X, the distribution of Y is a Gaussian mixture with 
density f(y) = j^v^s^iXsPs,!), where we are using <p to denote the density of a Gaussian 

random vector with mean X$/3s an d covariance I. Let (J*(X) = fj, E W 1 and A(X) = A £ M n><n 
be the mean vector and covariance matrix of Y, respectively. The covariance matrix of Y can be 
computed as A = E [i^ T ] — l^^ T , where 

\k) s w 5 j&S 

and 

E[yy T ] = E [(x/3)(xp) T ] +E [WW T ] 
Kk) s 
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With this notation, we can now compute the expectation of the covariance matrix Ex [A] , 
averaged over any distribution on X with independent, zero-mean and unit variance entries. To 
compute the first term, we have 



E x [E[YY T ]] 



ffl 



X 



E x i x J+ E - V < A './ 



+ / 



6 2 



ffl 



EE' + • 



s jes 
(l + fc/3^ in ) I 



where the second equality uses the fact that Ex 
we compute the second term as, 



I, and Ex 



for i ^ j. Next, 



Ex [/V 



A, 



E 



E E *i*?+E E 



S,t/ jesnc/ 



A, 



(a) 



E E ' 

5,(7 jeSnu 



(t)§'H f 



From here, note that there are (?) possible subsets 5. For each S, a counting argument reveals 
that there are (^) (f!Z;0 subsets £7 of size which have A = |»5ni7| overlaps with S. Thus the scalar 
multiplicative factor above can be written as 



A> 



ffl 



Ei 5nc/ i 



Ft 2 /A- 



5,(7 



Vmn \ 

W A=l 



p — k 
XJ \k - A 



Finally, using a substitution of variables (by setting A' = A — 1) and applying Vandermonde's 
identity [IT], we have 



Aj 



ffl 



£|Srw| 

S,C7 



2 ^ ^fc - 1 



2 ■ 

i inn. 

ffl 



*E 

A'=0 



P — 

A' J\k-\'-l 



ffl U-i 

k 2 B 2 

P 



Combining these terms, we conclude that 

Ex[ApQ] = 



1 + k(3 2 nin ( 1 



I. 
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B Proof of Lemma [3] 

Consider the following sequences of densities, 



— T- 
6D V- 775 



2tt 



exp 



and 



^i(y) 



E, 



2?r(l + 



: exp 



2(1 + L &^) 



where L ~ Bin(/c, 7). Our goal is to show that for each fixed y and row index i, the pointwise average 
of the stochastic sequence of densities ip\ over the ensemble of matrices X satisfies E,x[ipi(y, i; X)] = 
V>i(y). By symmetry, it is sufficient to compute this expectation for the subset S = {1, . . . , k}. 
When each Xij is i.i.d. drawn according to the 7-sparsified ensemble the random variable 
Z = [y — Pmin X^=i Xij) has a Gaussian mixture distribution which can be described as follows. 



Denoting the mixture label by L ~ Bin(A;, 7), then Z ~ N (^y, 
conditioned on the mixture label L = £, the random variable Z 



2 

in I n 



HL = e, for 1 = 0, ...,k. Thus, 

— {V - Pmin Ylj=l X ij) 2 nas a 

-. Evaluating 



2 

min 



noncentral chi-square distribution with 1 degree of freedom and parameter A = jM- 
Mgit) = E[e 4 ^], the moment-generating function [3] of Z, then gives us the desired quantity, 



E 



1 



x 

£ 

k 
1=0 
Er 



2tt 

1 



exp 



:(y-A 



k 

mi n ^ ^ 
3=1 



XijY 



2?r 

1 



E 



X 



'2ir 



exp 



0, 



:(y-A 



'in i n y ^ / 

3=1 



L 



min 



27 



F(L = i) 



2tt(1 + L/3 ™' n 



: exp 



2(\ -\- ^^mzn ' 



as claimed. 



F(L = i) 



C Proof of Lemma [5] 

Let Z be a random variable distributed according to the density 



^i(y) 



E L 



2tt(1 + 



■ r.n / ; > 



: exp 



y 



2(1 + 
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where L ~ Bin(k, 7) . To compute the entropy of Z, we can expand the following mutual information 
in two ways, I(Z; L) = H(Z) - H(Z\L) = H(L) - H(L\Z), and obtain 

H(Z) = H(Z\L) + H(L) — H(L\Z). 

The conditional distribution of Z given that L = £ is Gaussian, and so the conditional entropy of 
Z given L can be written as 



H(Z\L) = E L 



I / / T f)2 



log [ 2we 1 + 
2 S V V 7 



Furthermore, we can bound the conditional entropy of L given Z as < H{L\Z) < H{L). This 
gives upper and lower bounds on the entropy of Z as 

H{Z\L) < H(Z) < H(Z\L)+H(L). 

D Proof of Lemma [6] 

We first derive upper and lower bounds in the case when jk < 1. We can rewrite the binomial 
distribution as 



k-£ 



and hence 



e=i ^ ^ ' 

1 k log f 1 + 1/7 1 \ 



Taking the first two terms of the binomial expansion of ( 1 H — J and noting that all the terms 
are non-negative, we obtain the inequality 

/ 3 2 ■ V id 2 ■ 

I ^ _|_ r'mm j > ]^ _|_ "mm 

V 7 / 7 

and consequently log ^1 + ^™ in ^ > f log fl + 1 • Using a change of variables (by setting 
£' = £ — 1) and applying the binomial theorem, we thus obtain the upper bound 

E < 27^^(1 + ^)^:^7^(1-7)^ 
i 7 Hog (l + ^) £ (* " V(l - 7)^'^ 



2 

v ' ' £'=0 

-7/Uog 1 + ^™ 
2 V 7 
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To derive the lower bound, we will use the fact that 1 + x < e x for all x£K, and e x < 1 — f 
for x G [0, 1]. 

k 



E _ l E , og ( 1 + ^) 
> I log 

= \\og(l + ^\ (l-(l- 7 ) fc ) 



2 °V 7 
> - log f 1 + ^ ) (1 - e" 7fe ) 



2 a V 7 
§ W 1 + /*-»V7* 



2 V 7 / V 2 

Next, we examine the case when jk = t for some constant r. The derivation of the upper bound 
in the case when jk < 1 holds for the 7/c = r case as well. The proof of the lower bound follows 
the same steps as in the 7/c < 1 case, except that we stop before applying the last inequality (a). 

Finally, we derive bounds in the case when 7/c > 3. Since the mean of a L ~ Bin(/c, 7) random 
variable is 7/c, by Jensen's inequality the following bound always holds, 



El 



1 log ( 1 + L/ ^ in 



2 a V 7 



< 1 log(l + k[3 2 min ) 



To derive a matching lower bound, we use the fact that the median of a Bin(/c, 7) distribution is 
one of {|_7&J — 1, |_7^Jj l_7^J + !}■ This allows us to bound 



> 1 £ log (l + ^) 

> l log (i + 'W-^- ) £ K fl 

V / a 1 1, 1 1 



> Ii gfl+* 



4 V 3 



d-Yfcl — llfl 2 (■yk—2)8 2 kd 2 

where in the last step we used the fact that ^ Lf J > ^ — -l ' m ™ > _qai& f or 7 & > 3 j an d 

X^=mcdian P(^) — |- 

E Bounds on binomial entropy 

Lemma 7. Let L ~ Bin(/c, 7). Then 

H(L) < kH binary (-f). 

Furthermore, if 
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Proof. We can express the binomial variate as L = 5^i=i where Z\ ~ Ber(7) i.i.d. Since 
H(g(Z 1 , ... , Z k )) < H(Zi, ... ,Z k ),we have 

H{L) < H{Z l ,...,Z k ) = kHunaryij). 

Next we find the limit of kHunaryil) = ^7 log ^ + k(l — 7) log Let 7 = j^jm , and assume 
that f{k) — > 00 as fc — > 00. Hence the first term can be written as 

t7l0g I . J_„ . ^ + 

and so Arylog ^ — > if /(A;) = cj(log k). The second term can also be expanded as 

-k(l - 7) log(l - 7) = -fclogfl-— — -) + — — rlogfl 1 



kf(k)j f(k) ° v */(*) 

logfl-T^rrl +^irlog(l ' 



kf(k)J f(k) & V 
If f{k) — > 00 as k — > 00, then we have the limits 

lim ( 1 — , — — r ) = 1 and lim ( 1 



k— >oo \ kf(k) J k^oo \ kf(k) 

which in turn imply that 

f 1 \ k 1 , / 

lim log I 1 — I = and lim log 1 



□ 



fc^oo \ / k^oo f(k) \ kf(k) 

Lemma 8. Let L ~ Bin(fc, 7), then 

H(L) < Ilog(27re(Ml-7) + ^))- 

Proof. We immediately obtain this bound by applying the differential entropy bound on discrete 
entropy [8]. □ 
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